-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
[V0] Correct CUDA Graph capture for encoder-decoder models #22630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V0] Correct CUDA Graph capture for encoder-decoder models #22630
Conversation
This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models. The previous implementation of CUDA Graph capture incorrectly set the `max_seq_len_to_capture` based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature. The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a bug with CUDA Graph capture for encoder-decoder models. The change prevents max_seq_len_to_capture
from being incorrectly capped by a smaller model length, which is correct. I've suggested a more robust implementation that explicitly considers both encoder and decoder maximum lengths, making the fix more resilient and less dependent on default values.
vllm/config/__init__.py
Outdated
if not self.is_encoder_decoder: | ||
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture, | ||
self.max_model_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly addresses the issue for encoder-decoder models by not capping max_seq_len_to_capture
, it relies on the default value being sufficiently large. This could be fragile if a user sets a smaller max_seq_len_to_capture
or for models with very large encoder sequence lengths.
A more robust approach would be to explicitly determine the maximum length by considering both encoder and decoder sequence lengths, and then use that to cap max_seq_len_to_capture
. This aligns better with the PR's goal to 'correctly determine the max sequence length'.
Consider this alternative implementation:
if not self.is_encoder_decoder: | |
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture, | |
self.max_model_len) | |
max_len = self.max_model_len | |
if self.is_encoder_decoder: | |
max_len = max( | |
max_len, getattr(self.hf_config, "max_source_positions", 0)) | |
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture, max_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered the impact of this change on mllama? It would be great if you run benchmarks on both whisper and mllama
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Thanks for the suggestion. I have verified the impact on the Whisper-large-v3-turbo model using NVIDIA H20e GPUs. The change reduced the average inference time from approximately 320ms to 190ms for the same audio, which is a significant performance gain. This updated Whisper service has been running successfully in our production environment under high load for several weeks. Regarding mllama, I haven't used it before, but I will start some basic performance and load tests to evaluate the impact. |
re: @DarkLight1337 Thanks again for the suggestion and for your patience. I have completed the benchmarks for mllama and can now share the full findings. For the meta-llama/Llama-3.2-11B-Vision-Instruct model, tested on NVIDIA H20e GPUs, the performance remained stable:
This stability is expected, as the As a reminder, this fix provided a significant performance boost for Whisper-large-v3-turbo, reducing average latency from ~320ms to 190ms. This change is stable and has been running in our production environment for weeks. I believe this change is valuable for models like Whisper, where the encoder-decoder sequence length mismatch prevents CUDA Graph from being fully utilized. |
Ok nice, LGTM then, thanks for optimizing this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for fixing this, this is great!
I feel this feature could use a unit test to make sure the V1 port #21088 correctly works with it though.
PTAL at the failing Worker test |
Head branch was pushed to by a user without write access
re: @DarkLight1337 |
…ect#22630) Signed-off-by: Paul Pak <[email protected]>
This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models.
The previous implementation of CUDA Graph capture incorrectly set the
max_seq_len_to_capture
based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature. The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update